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ABSTRACT 


Two decades ago, one of us co-authored a paper com- 
menting on the previous 40 years of data modelling re- 
search and development [188]. That paper demonstrated 
that the relational model (RM) and SQL are the prevail- 
ing choice for database management systems (DBMSs), 
despite efforts to replace either them. Instead, SQL ab- 
sorbed the best ideas from these alternative approaches. 

We revisit this issue and argue that this same evolu- 
tion has continued since 2005. Once again there have 
been repeated efforts to replace either SQL or the RM. 
But the RM continues to be the dominant data model 
and SQL has been extended to capture the good ideas 
from others. As such, we expect more of the same in 
the future, namely the continued evolution of SQL and 
relational DBMSs (RDBMSs). We also discuss DBMS 
implementations and argue that the major advancements 
have been in the RM systems, primarily driven by chang- 
ing hardware characteristics. 


1 Introduction 


In 2005, one of the authors participated in writing a 
chapter for the Red Book titled “What Goes Around 
Comes Around” [188]. That paper examined the major 
data modelling movements since the 1960s: 


Hierarchical (e.g., IMS): late 1960s and 1970s 
Network (e.g., CODASYL): 1970s 

Relational: 1970s and early 1980s 
Entity-Relationship: 1970s 

Extended Relational: 1980s 

Semantic: late 1970s and 1980s 

Object-Oriented: late 1980s and early 1990s 
Object-Relational: late 1980s and early 1990s 
Semi-structured (e.g., XML): late 1990s and 2000s 


Our conclusion was that the relational model with an 
extendable type system (i.e., object-relational) has dom- 
inated all comers, and nothing else has succeeded in 
the marketplace. Although many of the non-relational 
DBMSs covered in 2005 still exist today, their vendors 
have relegated them to legacy maintenance mode and 
nobody is building new applications on them. This per- 
sistence is more of a testament to the “stickiness” of data 
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rather than the lasting power of these systems. In other 
words, there still are many IBM IMS databases running 
today because it is expensive and risky to switch them 
to use a modern DBMS. But no start-up would willingly 
choose to build a new application on IMS. 


A lot has happened in the world of databases since our 
2005 survey. During this time, DBMSs have expanded 
from their roots in business data processing and are now 
used for almost every kind of data. This led to the “Big 
Data” era of the early 2010s and the current trend of inte- 
grating machine learning (ML) with DBMS technology. 


In this paper, we analyze the last 20 years of data 
model and query language activity in databases. We 
structure our commentary into the following areas: (1) 
MapReduce Systems, (2) Key-value Stores, (3) Docu- 
ment Databases, (4) Column Family / Wide-Column, 
(5) Text Search Engines, (6) Array Databases, (7) 
Vector Databases, and (8) Graph Databases. 


We contend that most systems that deviated from 
SQL or the RM have not dominated the DBMS land- 
scape and often only serve niche markets. Many sys- 
tems that started out rejecting the RM with much fanfare 
(think NoSQL) now expose a SQL-like interface for RM 
databases. Such systems are now on a path to conver- 
gence with RDBMSs. Meanwhile, SQL incorporated 
the best query language ideas to expand its support for 
modern applications and remain relevant. 


Although there has not been much change in RM 
fundamentals, there were dramatic changes in RM sys- 
tem implementations. The second part of this paper 
discusses advancements in DBMS architectures that ad- 
dress modern applications and hardware: (1) Columnar 
Systems, (2) Cloud Databases, (3) Data Lakes / Lake- 
houses, (4) NewSQL Systems, (5) Hardware Acceler- 
ators, and (6) Blockchain Databases. Some of these 
are profound changes to DBMS implementations, while 
others are merely trends based on faulty premises. 


We finish with a discussion of important considera- 
tions for the next generation of DBMSs and provide part- 
ing comments on our hope for the future of databases in 
both research and commercial settings. 
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2 Data Models & Query Languages 


For our discussion here, we group the research and de- 
velopment thrusts in data models and query languages 
for database into eight categories. 


2.1 MapReduce Systems 


Google constructed their MapReduce (MR) framework 
in 2003 as a “point solution” for processing its periodic 
crawl of the internet [122]. At the time, Google had 
little expertise in DBMS technology, and they built MR 
to meet their crawl needs. In database terms, Map is a 
user-defined function (UDF) that performs computation 
and/or filtering while Reduce is a GROUP BY operation. 
To a first approximation, MR runs a single query: 


SELECT map() FROM crawl_table GROUP BY reduce() 


Google’s MR approach did not prescribe a specific 
data model or query language. Rather, it was up to the 
Map and Reduce functions written in a procedural MR 
program to parse and decipher the contents of data files. 

There was a lot of interest in MR-based systems at 
other companies in the late 2000s. Yahoo! developed 
an open-source version of MR in 2005, called Hadoop. 
It ran on top of a distributed file system HDFS that was 
a clone of the Google File System [134]. Several start- 
ups were formed to support Hadoop in the commercial 
marketplace. We will use MR to refer to the Google 
implementation and Hadoop to refer to the open-source 
version. They are functionally similar. 

There was a controversy about the value of Hadoop 
compared to RDBMSs designed for OLAP workloads. 
This culminated in a 2009 study that showed that data 
warehouse DBMSs outperformed Hadoop [172]. This 
generated dueling articles from Google and the DBMS 
community [123, 190]. Google argued that with care- 
ful engineering, a MR system will beat DBMSs, and a 
user does not have to load data with a schema before 
running queries on it. Thus, MR is better for “one shot” 
tasks, such as text processing and ETL operations. The 
DBMS community argued that MR incurs performance 
problems due to its design that existing parallel DBMSs 
already solved. Furthermore, the use of higher-level 
languages (SQL) operating over partitioned tables has 
proven to be a good programming model [127]. 

A lot of the discussion in the two papers was on imple- 
mentation issues (e.g., indexing, parsing, push vs. pull 
query processing, failure recovery). From reading both 
papers a reasonable conclusion would be that there is a 
place for both kinds of systems. However, two changes 
in the technology world rendered the debate moot. 

The first event was that the Hadoop technology and 
services market cratered in the 2010s. Many enterprises 
spent a lot of money on Hadoop clusters, only to find 
there was little interest in this functionality. Developers 
found it difficult to shoehorn their application into the 
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restricted MR/Hadoop paradigm. There were consider- 
able efforts to provide a SQL and RM interface on top 
of Hadoop, most notable was Meta’s Hive [30, 197]. 

The next event occurred eight months after the CACM 
article when Google announced that they were moving 
their crawl processing from MR to BigTable [164]. The 
reason was that Google needed to interactively update 
its crawl database in real time but MR was a batch sys- 
tem. Google finally announced in 2014 that MR had no 
place in their technology stack and killed it off [194]. 

The first event left the three leading Hadoop vendors 
(Cloudera, Hortonworks, MapR) without a viable prod- 
uct to sell. Cloudera rebranded Hadoop to mean the 
whole stack (application, Hadoop, HDFS). In a further 
sleight-of-hand, Cloudera built a RDBMS, Impala [150], 
on top of HDFS but not using Hadoop. They realized 
that Hadoop had no place as an internal interface in a 
SQL DBMS, and they configured it out of their stack 
with software built directly on HDFS. In a similar vein, 
MapR built Drill [22] directly on HDFS, and Meta cre- 
ated Presto [185] to replace Hive. 


Discussion: MR’s deficiencies were so significant that 
it could not be saved despite the adoption and enthu- 
siasm from the developer community. Hadoop died 
about a decade ago, leaving a legacy of HDFS clusters 
in enterprises and a collection of companies dedicated 
to making money from them. At present, HDFS has 
lost its luster, as enterprises realize that there are better 
distributed storage alternatives [124]. Meanwhile, dis- 
tributed RDBMSs are thriving, especially in the cloud. 
Some aspects of MR system implementations related 
to scalability, elasticity, and fault tolerance are carried 
over into distributed RDBMSs. MR also brought about 
the revival of shared-disk architectures with disaggre- 
gated storage, subsequently giving rise to open-source 
file formats and data lakes (see Sec. 3.3). Hadoop’s lim- 
itations opened the door for other data processing plat- 
forms, namely Spark [201] and Flink [109]. Both sys- 
tems started as better implementations of MR with pro- 
cedural APIs but have since added support for SQL [105]. 


2.2 Key/Value Stores 


The key/value (KV) data model is the simplest model 
possible. It represents the following binary relation: 


(key, value) 


A KV DBMS represents a collection of data as an as- 
sociative array that maps a key to a value. The value is 
typically an untyped array of bytes (i.e., a blob), and the 
DBMS is unaware of its contents. It is up to the appli- 
cation to maintain the schema and parse the value into 
its corresponding parts. Most KV DBMSs only provide 
get/set/delete operations on a single value. 

In the 2000s, several new Internet companies built 
their own shared-nothing, distributed KV stores for nar- 
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rowly focused applications, like caching and storing ses- 
sion data. For caching, Memcached [131] is the most 
well-known example of this approach. Redis [67] mar- 
kets itself as a Memcached replacement, offering a more 
robust query API with checkpointing support. For more 
persistent application data, Amazon created the Dynamo 
KV store in 2007 [125]. Such systems offer higher and 
more predictable performance, compared to a RDBMS, 
in exchange for more limited functionality. 


The second KV DBMS category are embedded stor- 
age managers designed to run in the same address space 
as a higher-level application. One of the first stan- 
dalone embedded KV DBMSs was BerkeleyDB from 
the early 1990s [170]. Recent notable entries include 
Google’s LevelDB [37], which Meta later forked as 
RocksDB [68]. 


Discussion: Key/value stores provide a quick “out-of- 
the-box” way for developers to store data, compared to 
the more laborious effort required to set up a table in a 
RDBMS. Of course, it is dangerous to use a KV store 
in a complex application that requires more than just a 
binary relation. If an application requires multiple fields 
in a record, then KV stores are probably a bad idea. Not 
only must the application parse record fields, but also 
there are no secondary indexes to retrieve other fields 
by value. Likewise, developers must implement joins or 
multi-get operations in their application. 


To deal with these issues, several systems began as 
a KV store and then morphed into a more feature-rich 
record store. Such systems replace the opaque value 
with a semi-structured value, such as a JSON docu- 
ment. Examples of this transition are Amazon’s Dy- 
namoDB [129] and Aerospike [9]. It is not trivial to re- 
engineer a KV store to make it support a complex data 
model, whereas RDBMSs easily emulates KV stores 
without any changes. If an application needs an embed- 
ded DBMS, there are full-featured choices available to- 
day, including SQLite [71] and DuckDB [180]. Hence, 
a RDBMS may be a better choice, even for simple appli- 
cations, because they offer a path forward if the applica- 
tion’s complexity increases. 


One new architecture trend from the last 20 years is 
using embedded KV stores as the underlying storage 
manager for full-featured DBMSs. Prior to this, build- 
ing a new DBMS requires engineers to build a custom 
storage manager that is natively integrated in the DBMS. 
MySQL was the first DBMS to expose an API that al- 
lowed developers to replace its default KV storage man- 
ager. This API enabled Meta to build RocksDB to re- 
place InnoDB for its massive fleet of MySQL databases. 
Similarly, MongoDB discarded their ill-fated MMAP- 
based storage manager in favor of WiredTiger’s KV 
store in 2014 [120, 138]. Using an existing KV store 
allows developers to write a new DBMS in less time. 
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2.3. Document Databases 


The document data model represents a database as a col- 
lection of record objects. Each document contains a hier- 
archy of field/value pairs, where each field is identified 
by a name and a field’s value can be either a scalar type, 
an array of values, or another document. The following 
example in JSON is a customer document that contain 
a nested list of purchase order records with their corre- 
sponding order items. 


{ “name”: “First Last”, 
Moola a {| ab “sie a ais, Maho ses: [iso l) I, 
qe Wale lP a aya Make a eae) Ves Al 


Document data models have been an active field of 
effort for several decades. This has given rise to data 
formats like SGML [117] and XML [118]. Despite 
the buzz with XML databases in the late 1990s, we 
correctly predicted in 2005 they would not supplant 
RDBMSs [188]. JSON has since overtaken XML to 
become the standard for data exchange for web-based 
applications. JavaScript’s popularity with developers 
and the accompanying ubiquity of JSON led several 
companies to create document-oriented systems that na- 
tively stored JSON in the 2000s. 

The inability of OLTP RDBMSs to scale in the 2000s 
ushered in dozens of document DBMSs that marketed 
themselves using the catchphrase NoSQL [110]. There 
were two marketing messages for such systems that res- 
onated with developers. First, SQL and joins are slow, 
and one should use a “faster” lower-level, record-at-a- 
time interface. Second, ACID transactions are unneces- 
sary for modern applications, so the DBMS should only 
provide weaker notion of it (i.e., BASE [179]). 

Because of these two thrusts, NoSQL came to stand 
for a DBMS that stored records or documents as JSON, 
supported a lower-level API, and weak or non-existent 
transactions. There are dozens of such systems, of 
which MongoDB [41] is the most popular. 


Discussion: Document DBMSs are essentially the same 
as object-oriented DBMSs from the 1980s and XML 
DBMSs from the late 1990s. Proponents of document 
DBMSs make the same argument as their OO/KML 
predecessors: storing data as documents removes the 
impedance mismatch between how application OO code 
interacts with data and how relational databases store 
them. They also claim that denormalizing entries into 
nested structures is better for performance because it re- 
moves the need to dispatch multiple queries to retrieve 
data related to a given object (i.e., “N+1 problem” in 
ORMs). The problems with denormalization/prejoining 
is an old topic that dates back to the 1970s [116]: (1) 
if the join is not one-to-many, then there will be dupli- 
cated data, (2) prejoins are not necessarily faster than 
joins, and (3) there is no data independence. 
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Despite strong protestations that SQL was terrible, by 
the end of the 2010s, almost every NoSQL DBMS added 
a SQL interface. Notable examples include DynamoDB 
PartiQL [56], Cassandra CQL [15], Aerospike AQL [9], 
and Couchbase SQL++ [72]. The last holdout was Mon- 
goDB, but they added SQL for their Atlas service in 
2021 [42]. Instead of supporting the SQL standard for 
DDL and DML operations, NoSQL vendors claim that 
they support their own proprietary query language de- 
rived or inspired from SQL. For most applications, these 
distinctions are without merit. Any language differences 
between SQL and NoSQL derivatives are mostly due to 
JSON extensions and maintenance operations. 

Many of the remaining NoSQL DBMSs also added 
strongly consistent (ACID) transactions (see Sec. 3.4). 
As such, the NoSQL message has morphed from “Do 
not use SQL — it is too slow!” to “Not only SQL” (i.e., 
SQL is fine for some things). 

Adding SQL and ACID to a NoSQL DBMS lowers 
their intellectual distance from RDBMSs. The main 
differences between them seems to be JSON support 
and the fact that NoSQL vendors allow “schema later” 
databases. But the SQL standard added a JSON data 
type and operations in 2016 [165, 178]. And as RDBMSs 
continue to improve their “first five minutes” experience 
for developers, we believe that the two kinds of systems 
will soon be effectively identical. 

Higher level languages are almost universally pre- 
ferred to record-at-a-time notations as they require less 
code and provide greater data independence. Although 
we acknowledge that the first SQL optimizers were slow 
and ineffective, they have improved immensely in the 
last 50 years. But the optimizer remains the hardest part 
of building a DBMS. We suspect that this engineering 
burden was a contributing factor to why NoSQL systems 
originally chose to not support SQL. 


2.4 Column-Family Databases 


There is another category of NoSQL systems that uses 
a data model called column-family (aka wide-column). 
Despite its name, column-family is not a columnar data 
model. Instead, it is a reduction of the document data 
model that only supports one level of nesting instead 
of arbitrary nesting; it is relation-like, but each record 
can have optional attributes, and cells can contain an ar- 
ray of values. The following example shows a mapping 
from user identifier keys to JSON documents that con- 
tain each user’s varying profile information: 


User1000 > { “name”: “Alice”, 


“accounts”: [ 123, 456 J, 
“email”: "xxx@xxx.edu” } 
User1001 7 { “name”: “Bob”, 
“email”: [ “yyy@yyy.org”, “zzz@zzz.com” ] } 


The first column-family model DBMS was Google’s 
BigTable in 2004 [111]. Instead of adopting SQL and 
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emerging columnar storage, Google used this data model 
with procedural client APIs. Other systems adopted the 
column-family model in an attempt to copy Google’s be- 
spoke implementation. Most notable are Cassandra [14] 
and HBase [28]. They also copied BigTable’s limita- 
tions, including the lack of joins and secondary indexes. 


Discussion: All our comments in Sec. 2.3 about the 
document model are also applicable here. In the early 
2010s, Google built RDBMSs on top of BigTable, in- 
cluding MegaStore [99] and the first version of Span- 
ner. Since then, Google rewrote Spanner to remove 
the BigTable remnants [98], and it is now the primary 
database for many of its internal applications. Several 
NoSQL DBMSs deprecated their proprietary APIs in 
favor of SQL but still retain their non-relational archi- 
tectures. Cassandra replaced their Thrift-API with a 
SQL-like language called CQL [15], and HBase now 
recommends the Phoenix SQL-frontend [57]. Google 
still offers BigTable as a cloud service, but the column- 
family model is a singular outlier with the same disad- 
vantages as NoSQL DBMSs. 


2.5 Text Search Engines 


Text search engines have existed for a long time, begin- 
ning with the seminal SMART system in the 1960s [184]. 
SMART pioneered information retrieval and the vector 
space model, now nearly universal in modern search 
engines, by tokenizing documents into a “bag of words” 
and then building full-text indexes (aka inverted indexes) 
on those tokens to support queries on their contents. The 
system was also cognizant of noise words (e.g., “the”, 
“a”), synonyms (e.g., “The Big Apple” is a synonym for 
“New York City”), salient keywords, and distance (e.g., 
“drought” often appears close to “climate change’). 

The leading text search systems today include Elastic- 
search [23] and Solr [70], which both use Lucene [38] 
as their internal search library. These systems offer 
good support for storing and indexing text data but offer 
none-to-limited transaction capabilities. This limitation 
means that a DBMS has to recover from data corruption 
by rebuilding the document index from scratch, which 
results in significant downtime. 

All the leading RDBMSs support full-text search in- 
dexes, including Oracle [52], Microsoft SQL Server [52], 
MySQL [43], and PostgreSQL [62]. Their search fea- 
tures have improved recently and are generally on par 
with the special-purpose systems above. They also have 
the advantage of built-in transaction support. But their 
integration of search operations in SQL is often clunky 
and differs between DBMSs. 


Discussion: Text data is inherently unstructured, which 
means that there is no data model. Instead, a DBMS 
seeks to extract structure (i.e., meta-data, indexes) from 
text to avoid “needle in the haystack” sequential searches. 
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There are three ways to manage text data in application. 
First, one can run multiple systems, such as Elastic- 
search for text and a RDBMS for operational workloads. 
This approach allows one to run “best of breed” systems 
but requires additional ETL plumbing to push data from 
the operational DBMS to the text DBMS and to rewrite 
applications to route queries to the right DBMSs based 
on their needs. Alternatively, one can run a RDBMS 
with good text-search integration capabilities but with 
divergent APIs in SQL. This latter issue is often over- 
come by application frameworks that hide this complex- 
ity (e.g., Django Haystack [20]). The third option is 
a polystore system [187] that masks the system differ- 
ences via middleware that exposes a unified interface. 
Inverted index-centric search engines based on SMART 

are used for exact match searches. These methods have 
been supplanted in recent years by similarity search us- 
ing ML-generated embeddings (see Sec. 2.7). 


2.6 Array Databases 


There are many areas of computing where arrays are an 
obvious data representation. We use the term “array” to 
mean all variants of them [182]: vectors (one dimen- 
sion — see Sec. 2.7), matrices (two dimensions), and 
tensors (three or more dimensions). For example, sci- 
entific surveys for geographic regions usually represent 
data as a multi-dimensional array that stores sensor mea- 
surements using location/time-based coordinates: 


(latitude, longitude, time, [vector-of-values] ) 


Several other data sets look like this, including ge- 
nomic sequencing and computational fluid dynamics. 
Arrays are also the core of most ML data sets. 

Although array-based programming languages have 
existed since the 1960s (APL [142]), the initial work 
on array DBMSs began in the 1980s. PICDMS is con- 
sidered to be the first DBMS implementation using the 
array data model [114]. The two oldest array DBMSs 
still being developed today are Rasdaman [66, 103] and 
kdb+ [34]. Newer array DBMSs include SciDB [54, 
191] and TileDB [76]. HDF5 [29] and NetCDF [46] are 
popular array file formats for scientific data. 

There are several system challenges with storing and 
querying real-world array data sets. Foremost is that ar- 
ray data does not always align to a regular integer grid; 
for example, geospatial data is often split into irregular 
shapes. An application can map such grids to integer 
coordinates via metadata describing this mapping [166]. 
Hence, most applications maintain array and non-array 
data together in a single database. 

Unlike row- or column-based DBMSs, querying array 
data in arbitrary dimensions presents unique challenges. 
The difficulty arises from storing multi-dimensional ar- 
ray data on a linear physical storage medium like a disk. 
To overcome these challenges, array DBMSs must em- 
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ploy indexing and data structures to support efficient 
traversal across array dimensions. 


Discussion: Array DBMSs are a niche market that has 
only seen adoption in specific verticals (we discuss vec- 
tor DBMSs next). For example, they have considerable 
traction in the genomics space. HDF5 is popular for 
satellite imagery and other gridded scientific data. But 
business applications rarely use dedicated array DBMSs, 
which is necessary for any product to survive. No ma- 
jor cloud provider offers a hosted array DBMS service, 
meaning they do not see a sizable market. 

The challenge that array DBMS vendors have al- 
ways faced is that the SQL includes support for or- 
dered arrays as first-class data types (despite this be- 
ing against the original RM proposal [115]). The first 
proposal to extend the unordered set-based RM with 
ordered rasters was in 1993 [155]. An early exam- 
ple of this was Illustra’s temporal (one-dimensional) 
data plugin [31]. SQL:1999 introduced limited sup- 
port for single-dimension, fixed-length array data types. 
SQL:2003 expanded to support nested arrays without a 
predefined maximum cardinality. Later entrants include 
Oracle Georaster [4] and Teradata [73]. Data cubes are 
special-purpose arrays [135], but columnar RDBMSs 
have eclipsed them for OLAP workloads because of 
their better flexibility and lower engineering costs [113]. 

More recently, the SQL:2023 standard includes sup- 
port for true multi-dimensional arrays (SQL/MDA) that 
is heavily inspired by Rasdaman’s RQL [166]. This 
update allows SQL to represent arrays with arbitrary 
dimensions using integer-based coordinates. In effect, 
this allows data cubes to exist in a SQL framework, but 
columnar DBMSs now dominate this market. 


2.7 Vector Databases 


Similar to how the column-family model is a reduction 
of the document model, the vector data model simplifies 
the array data model to one-dimensional rasters. Given 
that vector DBMSs are attracting the most attention 
right now from developers and investors (similar to the 
NoSQL fad), it is necessary to discuss them separately. 
The reason for this interest is because developers use 
them to store single-dimension embeddings generated 
from AI tools. These tools use learned transformations 
to convert a record’s data (e.g., text, image) into a vec- 
tor representing its latent semantics. For example, one 
could convert each Wikipedia article into an embedding 
using Google BERT and store them in a vector database 
along with additional article meta-data: 


(title, date, author, [embedding-vector] ) 
The size of these embedding vectors range from 100s 
of dimensions for simple transformers to 1000s for high- 


end models; these sizes will obviously grow over time 
with the development of more sophisticated models. 
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The key difference between vector and array DBMSs 
is their query patterns. The former are designed for 
similarity searches that find records whose vectors have 
the shortest distance to a given input vector in a high- 
dimensional space. The input vector is another embed- 
ding generated with the same transformer used to popu- 
late the database. Unlike array DBMSs, applications do 
not use vector DBMSs to search for matches at an off- 
set in a vector nor extract slices across multiple vectors. 
Instead, the dominant use case is this similarity search. 

To avoid brute force scans for finding the most sim- 
ilar records, vector DBMSs build indexes to accelerate 
approximate nearest neighbor (ANN) searches. Appli- 
cations issue queries with predicates on both the em- 
bedding index and non-embedding attributes (i.e., meta- 
data). The DBMS then chooses whether to use the non- 
embedding predicate on records before (pre-filter) or af- 
ter (post-filter) the vector search. 

There are dozens of new DBMSs in this emerging 
category, with Pinecone [58], Milvus [40], and Weav- 
iate [84] as the leading systems. Text search engines, 
including Elasticsearch [23], Solr [70], and Vespa [79], 
expanded their APIs to support vector search. Other 
DBMSs rebranded themselves as vector databases to 
jump on the bandwagon, such as Kdb+ [34]. 

One compelling feature of vector DBMSs is that 
they provide better integration with AI tools (e.g., Chat- 
GPT [16], LangChain [36]) than RDBMSs. These sys- 
tems natively support transforming a record’s data into 
an embedding upon insertion using these tools and then 
uses the same transformation to convert a query’s in- 
put arguments into an embedding to perform the ANN 
search; other DBMSs require the application to perform 
these transformations outside of the database. 


Discussion: Unlike array DBMSs that require a cus- 
tomized storage manager and execution engine to sup- 
port efficient operations on multi-dimensional data, vec- 
tor DBMSs are essentially document-oriented DBMSs 
with specialized ANN indexes. Such indexes are a fea- 
ture, not the foundation of a new system architecture. 

After LLMs became “mainstream” with ChatGPT in 
late 2022, it took less than one year for several RDBMSs 
to add their own vector search extensions. In 2023, 
many of the major RDBMSs added vector indexes, in- 
cluding Oracle [7], SingleStore [137], Rockset [8], and 
Clickhouse [157]. Contrast this with JSON support in 
RDBMSs. NoSQL systems like MongoDB and CouchDB 
became popular in the late 2000s and it took several 
years for RDBMSs to add support for it. 

There are two likely explanations for the quick pro- 
liferation of vector indexes. The first is that similarity 
search via embeddings is such a compelling use case 
that every DBMS vendor rushed out their version and 
announced it immediately. The second is that the en- 
gineering effort to introduce a new index data structure 
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is small enough that it did not take that much work for 
the DBMS vendors to add vector search. Most of them 
did not write their vector index from scratch and instead 
integrated an open-source library (e.g., pg Vector [145], 
DiskANN [19], FAISS [24]). 

We anticipate that vector DBMSs will undergo the 
same evolution as document DBMSs by adding features 
to become more relational-like (e.g., SQL, transactions, 
extensibility). Meanwhile, relational incumbents will 
have added vector indexes to their already long list of 
features and moved on to the next emerging trend. 


2.8 Graph Databases 


There has been a lot of academic and industry interest 
in the last decade in graph databases [183]. Many appli- 
cations use knowledge graphs to model semi-structured 
information. Social media applications inherently con- 
tain graph-oriented relationships (“likes”, “friend-of’’). 
Relational design tools provide users with an entity- 
relationship (ER) model of their database. An ER dia- 
gram is a graph; thus, this paradigm has clear use cases. 

The two most prevalent approaches to represent graphs 
are (1) the resource description framework (RDF) and 
(2) property graphs [126]. With property graphs, the 
DBMS maintains a directed multi-graph structure that 
supports key/value labels for nodes and edges. RDF 
databases (aka triplestores) only model a directed graph 
with labeled edges. Since property graphs are more com- 
mon and are a superset of RDF, we will only discuss 
them. We consider two use cases for graph DBMSs and 
discuss the problems that will limit their adoption. 

The first category of systems are for operational / 
OLTP workloads: an application, for example, adds a 
friend link in the database by updating a single record, 
presumably in a transactional manner. Neo4j [44] is the 
most popular graph DBMS for OLTP applications. It 
supports edges using pointers (as in CODASYL) but it 
does not cluster nodes with their “parent” or “offspring”’. 
Such an architecture is advantageous for traversing long 
edge chains since it will do pointer chasing, whereas a 
RDBMS has to do this via joins. But their potential mar- 
ket success comes down to whether there are enough 
“long chain” scenarios that merit forgoing a RDBMS. 

The second use case is analytics, which seeks to de- 
rive information from the graph. An example of this 
scenario is finding which user has the most friends un- 
der 30 years old. Notable entries like Tigergraph [74] 
and JanusGraph [32] focus on query languages and 
storage on a graph DBMS. Other systems, such as Gi- 
raph [26] and Turi [78] (formerly Graphlab [27]) pro- 
vide a computing fabric to support parallel execution of 
graph-oriented programs, typically written by a user. 

Unlike queries in relational analytics that are charac- 
terized by chains of joins, queries for graph analytics 
contain operations like shortest path, cut set, or clique 
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determination. Algorithm choice and data representa- 
tion will determine a DBMS’s performance. This argues 
for a computing fabric that allows developers to write 
their own algorithms using an abstraction that hides 
the underlying system topology. However, previous 
research shows that distributed algorithms rarely out- 
perform single-node implementations because of com- 
munication costs [160]. A better strategy is to compress 
a graph into a space-efficient data structure that fits in 
memory on a single node and then run the query against 
this data structure. All but the largest graph databases 
are probably best handled this way. 


Discussion: Regardless of whether a graph DBMS tar- 
gets OLTP or OLAP workloads, the key challenge these 
systems have to overcome is that it is possible to simu- 
late a graph as a collection of tables: 


Node (node_id, node_data) 
Edge (node_id_1, node_id_2, edge_data) 


This means that RDBMSs are always an option to sup- 
port graphs. But “vanilla” SQL is not expressive enough 
for graph queries and thus require multiple client-server 
roundtrips for traversal operations. 

Some RDBMSs, including MSSQL [3] and Oracle [50], 
provide built-in SQL extensions that make storing and 
querying graph data easier. Other DBMSs use a transla- 
tion layer on top of relations to support graph-oriented 
APIs. Amazon Neptune [45] is a graph-oriented veneer 
on top of Aurora MySQL. Apache AGE provides an 
OpenCypher interface on top of PostgreSQL [10]. 

More recently, SQL:2023 introduced property graph 
queries (SQL/PGQ) for defining and traversing graphs 
in a RDBMS [196]. The syntax builds on existing lan- 
guages (e.g., Neo4j’s Cypher [49], Oracle’s PGQL [51], 
and TigerGraph’s GSQL [75]), and shares aspects of the 
emerging GQL standard [126]. Thus, SQL/PGQ further 
narrows the functionality difference between RDBMSs 
and native graph DBMSs. 

The question is whether graph DBMS vendors can 
make their specialized systems fast enough to over- 
come the above disadvantages. There have been several 
performance studies showing that graph simulation on 
RDBMSs outperform graph DBMSs [130, 143]. More 
recent work showed how SQL/PGQ in DuckDB outper- 
forms a leading graph DBMS by up to 10x [196]. This 
trend will continue with further improvements in worst- 
case optimal joins [132, 168] and factorized execution 
algorithms [100] for graph queries in RDBMSs. 


2.9 Summary 


A reasonable conclusion from the above section is that 
non-SQL, non-relational systems are either a niche mar- 
ket or are fast becoming SQL/RM systems. Specifically: 


e MapReduce Systems: They died years ago and are, 
at best, a legacy technology at present. 
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e Key-value Stores: Many have either matured into 

RM systems or are only used for specific problems. 

These can generally be equaled or beaten by modern 

high-performance RDBMSs. 

Document Databases: Such NoSQL systems are on 

a collision course with RDBMSs. The differences 

between the two kinds of systems have diminished 

over time and should become nearly indistinguish- 
able in the future. 

e Column-Family Systems: These remain a niche 

market. Without Google, this paper would not be 

talking about this category. 

Text Search Engines: These systems are used for 

text fields in a polystore architecture. It would be 

valuable if RDBMSs had a better story for search so 
these would not have to be a separate product. 

Array Databases: Scientific applications will con- 

tinue to ignore RDBMSs in favor of bespoke array 

systems. They may become more important because 

RDBMSs cannot efficiently store and analyze arrays 

despite new SQL/MDA enhancements. 

Vector Databases: They are single-purpose DBMSs 

with indexes to accelerate nearest-neighbor search. 

RM DBMSs should soon provide native support for 

these data structures and search methods using their 

extendable type system that will render such special- 
ized databases unnecessary. 

e Graph Databases: OLTP graph applications will 
be largely served by RDBMSs. In addition, ana- 
lytic graph applications have unique requirements 
that are best done in main memory with specialized 
data structures. RDBMSs will provide graph-centric 
APIs on top of SQL or via extensions. We do not ex- 
pect specialized graph DBMSs to be a large market. 


Beyond the above, there are also proposals to rebrand 
previous data models as something novel. For example, 
graph-relational [158] is the same as the semantic data 
model [202]. Likewise, document-relational is the doc- 
ument model with foreign keys [199]. Others provide a 
non-SQL veneer over a RDBMS (e.g., PRQL [64], Mal- 
loy [39]). Although these languages deal with some of 
SQL’s shortcomings, they are not compelling enough to 
overcome its entrenched userbase and ecosystem. 


3 System Architectures 


There have been major new ideas in DBMS architec- 
tures put forward in the last two decades that reflect- 
ing changing application and hardware characteristics. 
These ideas range from terrific to questionable, and we 
discuss them in turn. 


3.1 Columnar Systems 


To understand the appeal of columnar DBMSs, we need 
to explain the origins of the data warehouse (OLAP) 
market. Beginning in the mid-1990s, enterprises started 
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collecting their customer facing (usually sales) data. 
Brick-and-mortar retailers (e.g., Walmart) were at fore- 
front of constructing historical sales databases. These 
companies generally found that a sales data warehouse 
would pay for itself in better stock ordering and rota- 
tion decisions within six months. Such customer facing 
databases are now omnipresent in enterprises. 
Data warehouse applications have common proper- 
ties that are distinct from OLTP workloads: 
1. They are historical in nature (i.e., they are loaded 
periodically and then are read-only). 
2. Organizations retain everything as long as they can 
afford the storage — think terabytes to petabytes. 
3. Queries typically only access a small subset of at- 
tributes from tables and are ad-hoc in nature. 
Ralph Kimball was an early proponent of star schema 
data modelling for data warehouses [148, 149]. The idea 
was to construct a fact table that held item-level trans- 
actional data. The classic example is a fact table that 
contains a record for every item purchased in a retail en- 
terprise. Then, one surrounds the fact table with dimen- 
sion tables that contain common information factored 
out from the fact table to save space. Again, in a retail 
setting, these dimension tables would include informa- 
tion about customers, products, stores, and time. 
Organizing the DBMS’s storage by columns instead 
of rows has several benefits [87]. First, compressing 
columnar data is more effective than row-based data be- 
cause there is a single value type in a data block of- 
ten many repeated bytes. Second, a Volcano-style en- 
gine executes operators once per row. In contrast, a 
column-oriented engine has an inner loop that processes 
a whole column using vectorized instructions [106, 147]. 
Lastly, row stores have a large header for each record 
(e.g., 20 bytes) to track nulls and versioning meta-data, 
whereas column stores have minimal storage overhead 
per record. 


Discussion: Over the last two decades, all vendors ac- 
tive in the data warehouse market have converted their 
offerings from a row store to a column store. This tran- 
sition brought about significant changes in the design 
of DBMSs. In addition, several new vendors have en- 
tered the market in the last two decades with column 
store offerings, for example Amazon’s Redshift [94] and 
Google’s BigQuery [162] along with offerings from in- 
dependent companies (e.g., Snowflake [121]). 

In summary, column stores are new DBMS implemen- 
tations with specialized optimizers, executors, and stor- 
age formats. They have taken over the data warehouse 
marketplace because of their superior performance. 


3.2 Cloud Databases 


The rise of cloud platforms in the late 2000s has also 
greatly affected the implementation (and sales model) of 
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DBMSs. Initial cloud DBMS offerings repackaged on- 
prem systems into managed VMs with direct-attached 
storage. But over the last 20 years, networking band- 
width has increased much faster than disk bandwidth, 
making network attached storage (NAS) attractive as an 
alternative to attached storage. This has caused a pro- 
found rethinking of DBMS architectures for the cloud. 
All major cloud vendors offer NAS via object stores 
(e.g., Amazon $3) with some DBMS functionality (e.g., 
replication, filtering). Beyond better economics com- 
pared to direct-attached storage, object stores have sev- 
eral advantages that compensate for the cost of the 
added network link. First, because the compute nodes 
are disconnected from the storage nodes, a system can 
provide per-query elasticity; the DBMS can add new 
compute nodes dynamically without having to reshuffle 
data. It also allows the DBMS to use different hard- 
ware for its storage nodes than compute nodes. Second, 
the system can reassign compute nodes to other tasks 
if a DBMS is underutilized. On the other hand, in a 
shared-nothing DBMS, a node must always be online to 
handle incoming query requests. Lastly, pushing down 
computation into the storage nodes is possible (and gen- 
erally advantageous). This execution strategy is known 
as “pushing the query to the data” versus “pulling the 
data to the query” and is well understood in DBMSs. 
Generally, the first two ideas are called “serverless 
computing”, and was introduced for cloud-native DBMSs 
by Snowflake [121]. Other vendors have moved or 
are in the process of moving to a serverless environ- 
ment for their cloud offerings. Effective utilization of 
this model requires a hosted multi-node environment in 
which multiple DBMS customers are grouped onto the 
same node(s) with a multi-tenant execution scheme. 


Discussion: The advent of cloud databases is another 
example of “what goes around comes around”. Multi- 
node shared-disk DBMSs are an old idea that histori- 
cally tended not to work out well. However, it is back in 
vogue with technology change (faster networking) and 
moving to the cloud. In addition, time-sharing services 
were popular in the 1970s when computers were big and 
expensive. Cloud platforms are big time-sharing ser- 
vices, so the concept is back after a few decades. Since 
enterprises are moving everything possible to the cloud, 
we expect this shared-disk to dominate DBMS architec- 
tures. Hence, we do not foresee shared-nothing architec- 
tures resurfacing in the future. 

The cloud has profoundly impacted DBMSs, causing 
them to be completely re-architected. The movement 
of computing from on-prem to the cloud generates a 
once-in-a-lifetime opportunity for enterprises to refac- 
tor codebases and remove bad historical technology de- 
cisions. A cloud environment also provides several ben- 
efits to vendors that are not possible with on-prem de- 
ployments. Foremost is that vendors can track usage 
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trends for all their customers: they can monitor unex- 
pected behavior, performance degradations, and usage 
patterns. Moreover, they can push incremental updates 
and code patches without disrupting service. 

From a business perspective, open-source DBMSs 
face the danger of becoming too popular and being mon- 
etized by the major cloud providers. The public spats 
between Amazon and ISVs like MongoDB [153] and 
Elasticsearch [101] are notable examples. 


3.3. Data Lakes / Lakehouses 


Another trend that the cloud platforms fomented is the 
movement away from monolithic, dedicated data ware- 
houses for OLAP workloads and towards data lakes 
backed by object stores. With legacy data warehouses, 
organizations load data into the DBMS, which the sys- 
tem stashes in managed storage with proprietary for- 
mats. Vendors viewed their DBMSs as the “gatekeepers” 
for all things related to data in an organization. However, 
this has not been the model of many organizations, es- 
pecially technology companies, for the last decade. 

With a data lake architecture, applications upload files 
to a distributed object store, bypassing the traditional 
route through the DBMS [167]. Users then execute 
queries and processing pipelines on these accumulated 
files using a lakehouse (a portmanteau of data ware- 
house and data lake) execution engine [93]. These lake- 
house systems provide a unified infrastructure support- 
ing SQL and non-SQL workloads. The latter is crucial 
as the last decade has shown that data scientists and ML 
practitioners typically use Python-based notebooks that 
use Panda’s DataFrame API [159] to access data instead 
of SQL. Several projects leverage DBMS methods to 
optimize DataFrame processing, including Dask [181], 
Polars [61], Modin [177], and Bodo [198]. 

Instead of using DBMS-specific proprietary file for- 
mats or inefficient text-based files (e.g., CSV, JSON), 
applications write data to data lakes using open-source, 
disk-resident file formats [203]. The two most pop- 
ular formats are Twitter/Cloudera’s Parquet [55] and 
Meta’s ORC [53, 140]. Both of them borrow tech- 
niques from earlier columnar storage research, such as 
PAX [90], compression [87], and nested-data (JSON) 
shredding [121, 161]. Apache Arrow [11] is a similar 
binary format for exchanging in-memory data between 
systems. Open-source libraries for reading/writing these 
formats allow disparate applications to create data files 
that other systems then parse and consume, thereby en- 
hancing data sharing across services and business units. 


Discussion: Data lakes are the successor to “Big Data” 
movement from the early 2010s, partly led by the pop- 
ularity of MR systems (Sec. 2.1) and column stores 
(Sec. 3.1). At first glance, a data lake seems like a terri- 
ble idea for an organization: allowing any application to 
write arbitrary files into a centralized repository without 
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any governance is a recipe for integrity, discovery, and 
versioning problems [167]. Lakehouses provide much- 
needed control over these environments to help mitigate 
many problems with meta-data, caching, and indexing 
services [93]. Additional middleware that tracks new 
data and supports transactional updates, such as Delta 
Lake [92], Iceberg [6], and Hudi [5], make lakehouses 
look more like a traditional data warehouse. 

Data lakes introduce new challenges to query opti- 
mization. DBMSs have always struggled with acquir- 
ing precise statistics on data, leading to poor query 
plan choices [154]. However, a data lake system may 
completely lack statistics on newly ingested data files. 
Consequently, incorporating adaptive query processing 
strategies is imperative in the cloud to enable a DBMS to 
dynamically modify query plans during execution based 
on observed data characteristics [97, 105, 163]. 

All the major cloud vendors now offer some variation 
of a managed data lake service. Since data lake sys- 
tems backed by object stores are much cheaper per giga- 
byte than proprietary data warehouses, the legacy OLAP 
vendors (e.g., Teradata, Vertica) have extended their 
DBMSs to support reading data from object stores in re- 
sponse to this pricing pressure. Several independent sys- 
tems are also in this space, including Databricks [105], 
Dremio [21], PrestoDB [63], and Trino [77]. 


3.4 NewSQL Systems 


In the late 2000s, there were multiple distributed NoSQL 
DBMSs available designed to scale horizontally to sup- 
port online applications with large number of concurrent 
users [110]. However, many organizations could not use 
these NoSQL systems because their applications could 
not give up strong transactional requirements. But the 
existing RDBMSs (especially open-source ones) were 
not able to (natively) scale across multiple machines. In 
response, NewSQL systems arrived in the early 2010s 
seeking to provide the scalability of NoSQL systems for 
OLTP workloads while still supporting SQL [95, 171]. 
In other words, these new systems sought to achieve the 
same scalability of NoSQL DBMSs from the 2000s but 
still keep the RM and ACID transactions of the legacy 
DBMSs from the 1990s. 

There were two main groups of NewSQL systems. 
The first was in-memory DBMSs, including H-Store [144, 
189] (commercialized as VoltDB [83]), SingleStore [69], 
Microsoft Hekaton [128], and HyPer [146]. Other start- 
up offerings included disk-oriented, distributed DBMSs 
like NuoDB [47] and Clustrix [17]. 


Discussion: There has yet to be a dramatic uptake in 
NewSQL DBMS adoption [96]. The reason for this lack- 
luster interest is that existing DBMSs were good enough 
for the time, which means organizations are unwilling to 
take on the costs and risk of migrating existing applica- 
tions to newer technologies. Companies are more risk- 
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averse with changing OLTP DBMSs than with OLAP. 
If an OLTP DBMS fails, companies cannot execute the 
transactions they need to generate revenue. In contrast, 
an OLAP DBMS failure could be limited to temporarily 
inconveniencing an analyst or data scientist. 

There were other restrictions in NewSQL DBMSs, 
such as only supporting a subset of standard SQL or 
bad performance on multi-node transactions. Some 
NewSQL products, like Microsoft’s Hekaton, were only 
available as an extension to a legacy DBMS, requiring 
the faster engine to use the slower DBMS’s interfaces. 

NewSQL vendors also incorrectly anticipated that in- 
memory DBMS adoption would be larger in the last 
decade. Flash vendors drove down costs while improv- 
ing storage densities, bandwidth, and latencies. Higher 
DRAM costs and the collapse of persistent memory 
(e.g., Intel Optane) means that SSDs will remain domi- 
nant for OLTP DBMSs. 

The aftermath of NewSQL is a new crop of distributed, 
transactional SQL RDBMSs. These include TiDB [141], 
CockroachDB [195], PlanetScale [60] (based on the 
Vitess sharding middleware [80]), and YugabyteDB [86]. 
The major NoSQL vendors also added transactions to 
their systems in the last decade despite previously strong 
claims that they were unnecessary. Notable DBMSs that 
made the shift include MongoDB, Cassandra, and Dy- 
namoDB. This is of course due to customer requests 
that transactions are in fact necessary. Google said this 
cogently when they discarded eventual consistency in 
favor of real transactions with Spanner in 2012 [119]. 


3.5 Hardware Accelerators 


There has been a hunt for a cost-effective hardware ac- 
celerator for DBMSs for the last 50 years. The promise 
is obvious: specialized hardware designed for a DBMS 
should easily outperform a conventional CPU. 

In the 1980s, vendors fabricated custom hardware to 
accelerate DBMSs and marketed them as database ma- 
chines [107]. Britton-Lee released the first commercial 
accelerator product IDM/S500) in 1981 [192] that con- 
tained a conventional CPU with a hardware accelerator 
that offloaded portions of a query’s execution. This ac- 
celerator targeted a small subset of the execution path, 
and was not cost-effective. Teradata introduced its own 
database machine that provided network hardware for 
sorting in-flight tuples (Y-net [1]), but it was dropped 
for a software-only solution [85]. All other custom hard- 
ware DBMS acceleration during the 1980s failed. 

Instead of building custom hardware for DBMSs, the 
last 20 years have been about using commodity hard- 
ware (FPGAs, GPUs) to accelerate queries. This is an 
enticing idea: a vendor can get the benefits of a DBMS 
accelerator without the cost of fabricating the hardware. 

Netezza was one of the first FPGA-based DBMSs 
that started in the late 1990s as a fork of PostgreSQL. 
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It used an FPGA to accelerate searches on disk-resident 
pages, but originally could not search in-memory pages. 
Netezza corrected this limitation in a later version [2]. 
Swarm64 attempted to sell a FPGA accelerator for Post- 
greSQL but switched to a software-only architecture 
without the FPGA before they were acquired [91]. Vitesse’s 
Deepgreen DB [81] is the only remaining FPGA-enhanced 
DBMS available from an ISV. 


There is more activity in the GPU-accelerated DBMS 
market. Notable GPU DBMSs include Kinetica [35], 
Sqream [35], Brytlyt [13], and HeavyDB [48]. If data 
does not fit in GPU memory, then query execution is 
bottlenecked on loading data into the device, thereby 
rendering the hardware’s parallelization benefits moot. 


Discussion: There are several conclusions that we can 
draw from the above analysis. First, these systems are 
all focused on the OLAP market and only for RDBMSs; 
there are essentially no data model implications to the 
discussion in this section. Also, OLAP workloads will 
continue to move aggressively to the cloud, but special- 
purpose hardware is not likely to find acceptance unless 
it is built by the cloud vendor. 

Creating custom hardware just for a DBMS is not 
cost-effective for most companies. Commodity hard- 
ware avoids this problem but there is still the challenge 
of integrating the hardware into a DBMS. The reason 
why there are more GPU DBMSs than FPGA systems 
is because there are existing support libraries available 
for GPUs (e.g., Nvidia CUDA [169]). But cloud CPU- 
based compute resources are incredibly cheap due to 
economies of scale. The success of any accelerator is 
likely to be limited to on-prem databases, but this mar- 
ket is not growing at the same rate as cloud databases. 


Even if one could get an accelerator to market that 
showed orders of magnitude improvement over existing 
technologies, that only solves half the problem needed 
for adoption and success. A hardware-only company 
must find somebody to add support for its accelerator 
in a DBMS. If the accelerator is an optional add-on to 
the DBMS, then adoption will be low and thus a DBMS 
vendor will not want to spend engineering time on sup- 
porting it. If the accelerator is a critical component of 
the DBMS, then no vendor would outsource the devel- 
opment of such an important part to an outside vendor. 


The only place that custom hardware accelerators will 
succeed is for the large cloud vendors. They can justify 
the $50-100m R&D cost of custom hardware at their 
massive scale. They also control the entire stack (hard- 
ware and software) and can integrate their hardware at 
critical locations. Amazon did this already with their 
Redshift AQUA accelerators [102]. Google BigQuery 
has custom components for in-memory shuffles [89]. 

In spite of the long odds, we predict that there will be 
many attempts in this space over the next two decades. 
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3.6 Blockchain Databases 


As of this writing, a waning database technology fad 
is blockchains. These are decentralized log-structured 
databases (i.e., ledger) that maintain incremental check- 
sums using some variation of Merkle trees. These in- 
cremental checksums are how a blockchain ensures that 
the database’s log records are immutable: applications 
use these checksums to verify that previous database up- 
dates have not been altered. 

The ideal use case for blockchain databases is peer-to- 
peer applications where one cannot trust anybody. There 
is no centralized authority that controls the ordering of 
updates to the database. Thus, blockchain implementa- 
tions use a BFT commit protocol to determine which 
transaction to apply to the database next. 

At the present time, cryptocurrencies (Bitcoin) are 
the only use case for blockchains. In addition, there 
have been attempts to build a usable DBMS on top of 
blockchains, notably Fluree [25], BigChainDB [12], and 
ResilientDB [136]. These vendors (incorrectly) promote 
the blockchain as providing better security and auditabil- 
ity that are not possible in previous DBMSs. 


Discussion: We are required to place trust in several en- 
tities in today’s society. When one sells a house, they 
trust the title company to manage the transaction. The 
only applications without real-world trust are dark web 
interactions (e.g., money laundering). Legitimate busi- 
nesses are unwilling to pay the performance price (about 
five orders of magnitude) to use a blockchain DBMS. If 
organizations trust each other, they can run a shared dis- 
tributed DBMS more efficiently without wasting time 
with blockchains. To the best of our knowledge, all the 
major cryptocurrency exchanges run their businesses off 
traditional RDBMSs and not blockchain systems. 

Blockchain proponents make additional meaningless 
claims of achieving data resiliency through replication 
in a peer-to-peer environment. No sensible company 
would rely on random participants on the Internet as the 
backup solution for mission-critical databases. 

There is possibly a (small) market for private block- 
chain DBMSs. Amazon’s Quantum Ledger Database 
(QLDB) released in 2018 [65] provides the same im- 
mutable and verifiable update guarantees as a blockchain, 
but it is not decentralized (i.e., no BFT commit protocol). 
Amazon built QLDB after finding no compelling use 
case for a fully decentralized blockchain DBMS [108]. 


3.7 Summary 


The key takeaways from the major technological thrusts 
in database systems are as follows: 


e Columnar Systems: The change to columnar stor- 
age revolutionized OLAP DBMS architectures. 

e Cloud Databases: The cloud has upended the con- 
ventional wisdom on how to build scalable DBMSs. 
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Except for embedded DBMSs, any product not start- 

ing with a cloud offering will likely fail. 

Data Lakes / Lakehouses: Cloud-based object stor- 

age using open-source formats will be the OLAP 

DBMS archetype for the next ten years. 

e NewSQL Systems: They leverage new ideas but 
have yet to have the same impact as columnar and 
cloud DBMSs. It has led to new distributed DBMSs 
that support stronger ACID semantics as a counter 
to NoSQL’s weaker BASE guarantees. 

e Hardware Accelerators: We do not see a use case 

for specialized hardware outside of the major cloud 

vendors, though start-ups will continue to try. 

Blockchain Databases: An inefficient technology 

looking for an application. History has shown this is 

the wrong way to approach systems development. 


4 Parting Comments 


Our analysis of the last two decades in databases has sev- 
eral takeaways. Unfortunately, some of these are repeats 
of the warnings from the 2005 paper. 


Never underestimate the value of good marketing for 
bad products. The database market is highly competi- 
tive and lucrative. This competition drives vendors to 
claim that their new technologies will solve all sorts of 
problems and change developers’ lives for the better. Ev- 
ery developer has struggled with databases before, so 
they are especially amenable to such marketing. Infe- 
rior DBMS products have succeeded via strong market- 
ing despite the existence of better options available at 
the time: Oracle did this in the 1980s, MySQL did this 
in the 2000s, and MongoDB did this in the 2010s. These 
systems got enough traction early on to buy them time 
to fix the engineering debt they accumulated earlier. 


Beware of DBMSs from large non-DBMS vendors. 
One interesting aspect in the last ten years of databases 
is the trend of tech companies building DBMSs in-house 
that they then spin out as open-source projects. All these 
systems started life as purpose-built applications for a 
tech company. The company then releases the DBMS 
as an open-source project (often pushed to the Apache 
Foundation for stewardship) in hopes to achieve “free” 
development from external users. 

Some times they come from large companies that can 
afford to allocate resources to developing new systems. 
Notable examples include Meta (Hive [197], Presto [63], 
Cassandra [14], RocksDB [68]) and LinkedIn (Kafka [33], 
Pinot [59], Voldemort [82]). Other systems are from 
start-ups building a data-intensive product where they 
felt the need to also build a DBMS. The most successful 
examples are 10gen (MongoDB) and PowerSet (HBase), 
but there also many failed endeavors. 

This trend to avoid “not invented here” software is 
partly because many companies’ promotion path favors 
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engineers who make new internal systems, even if ex- 
isting tools are sufficient. But this perversion led many 
teams without DBMS engineering experience to under- 
take building a new system. One should be wary of 
such systems when a company first open-sources them, 
as they are almost always immature technologies. 


Do not ignore the out-of-box experience. One of the 
salient selling points of many non-relational DBMSs is 
a better “out-of-box” experience than RDBMSs. Most 
SQL systems require one first to create a database and 
then define their tables before they can load data. This 
is why data scientists use Python notebooks to analyze 
data files quickly. Every DBMS should, therefore, make 
it easy to perform in situ processing of local and cloud- 
storage files. DuckDB’s rising popularity is partly due 
to its ability to do this well. 

Vendors should also consider additional challenges 
that customers will inevitably face with databases, in- 
cluding physical design, knob tuning, schema design, 
and query tuning. There is a crucial need for what one 
of us calls “self-driving” DBMSs [173]. 


Developers need to query their database directly. 
Most OLTP applications created in the last 20 years pri- 
marily interact with databases via an abstraction layer, 
such as an endpoint API (e.g., REST, GraphQL) or 
an object-relational mapper (ORM) library. Such lay- 
ers translate an application’s high-level requests into 
database queries. ORMs also automatically handle main- 
tenance tasks, such as schema migrations. One could 
argue that since OLTP developers never write raw SQL 
in their applications, it does not matter what data model 
their DBMS uses as these layers hide it. 

ORMs are a vital tool for rapid prototyping. But they 
often sacrifice the ability to push logic into the DBMS 
in exchange for interoperability with multiple DBMSs. 
Developers fall back to writing explicit database queries 
to override the poor auto-generated queries. This is why 
using a RDBMS that supports SQL is the better choice. 


The impact of AI/ML on DBMSs will be significant. 
How DBMSs should interact with modern AI/ML tools 
has recently become a crucial question, especially with 
the advent of LLMs (e.g., ChatGPT). Although this field 
is moving rapidly, we offer a few initial comments. 
There is a resurgence in using natural languages (NLs) 
to query databases due to advancements in LLMs at 
converting NL to query code (e.g., SQL) [133]. Some 
have even suggested that such AI-powered query inter- 
faces will render SQL obsolete. NL interfaces are an 
old research topic that dates back to the 1970s [139], 
but which historically has poor outcomes and thus little 
widespread use [88]. We acknowledge LLMs have im- 
pressive results for this task but caution those who think 
NL will replace SQL. Nobody will write OLTP appli- 
cations using an NL, as most generate queries using 
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ORMs. For OLAP databases, NL could prove helpful 
in constructing the initial queries for exploratory anal- 
ysis. However, these queries should be exposed to a 
dashboard-like refinement tool since English and other 
NLs are rife with ambiguities and impreciseness. 

There is a reluctance to depend on current LLM tech- 
nology for decision-making inside the enterprise, espe- 
cially with financial data. The biggest issue is that the 
output of an LLM is not explainable to a human. Sec- 
ond, LLM systems require more training data than “tra- 
ditional” ML systems (e.g., random forests, Bayesian 
models). Companies generally cannot outsource the cre- 
ation of training data for these models to unskilled peo- 
ple. For these reasons, the uptake of LLMs for enter- 
prise data will be cautiously slow. 

Lastly, there is a considerable amount of recent re- 
search on using AI/ML to optimize the DBMSs [174]. 
Examples include ML-oriented query optimizers [152, 
156], configuration tuners [200, 204], and access meth- 
ods [151, 193]. Although such ML-assisted optimiza- 
tions are powerful tools to improve the performance of 
DBMSs, it does not obviate the need for high-quality 
systems engineering. 


5 Conclusion 


We predict that what goes around with databases will 
continue to come around in upcoming decades. Another 
wave of developers will claim that SQL and the RM are 
insufficient for emerging application domains. People 
will then propose new query languages and data models 
to overcome these problems. There is tremendous value 
in exploring new ideas and concepts for DBMSs (it is 
where we get new features for SQL). The database re- 
search community and marketplace are more robust be- 
cause of it. However, we do not expect these new data 
models to supplant the RM. 

Another concern is the wasted effort of new projects 
reimplementing the same components that are not novel 
but necessary to have a production-ready DBMS (e.g., 
config handlers, parsers, buffer pools). To accelerate 
the next generation of DBMSs, the community should 
foster the development of open-source reusable compo- 
nents and services [112, 176]. There are some efforts to- 
wards this goal, including for file formats (see Sec. 3.3), 
query optimization (e.g., Calcite [104], Orca [186]), and 
execution engines (e.g., DataFusion [18], Velox [175]). 
We contend that the database community should strive 
for a POSIX-like standard of DBMS internals to accel- 
erate interoperability. 

We caution developers to learn from history. In other 
words, stand on the shoulders of those who came before 
and not on their toes. One of us will likely still be alive 
and out on bail in two decades, and thus fully expects to 
write a follow-up to this paper in 2044. 
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